1 Introduction

Climate change has been, for a few decades, an important topic, as this global phenomenon has been affecting the Earth for many years. Scientists around the world try to understand the global warming phenomenon, its causes and try to predict its future trajectory in years to come. In this project, we try to predict global temperature by using time series theory. More precisely, we try to answer two main questions : is there a seasonality in the temperature measures, and what is the form of global warming. To do so, we first start by exploring and transforming the data in Data exploration, then we try modeling the global trend and its residuals by using Generalized Least Squares and ARMA models in Modeling the temperature using GLS and ARMA. Thanks to this model, we can answer the two main questions (see Answering the two main questions). Additionally, we can now predict the temperature in the future, which is done in [Prediction up to 2050]. In this section, we also compare the GLS model to another model paradigm, which uses Generalized Additive Models and SARIMA models. Lastly, we conclude by stating the take-home messages and further steps that would be interesting to explore in Conclusion.

2 Data exploration

2.1 Dataset Description

The time series which will be used in this project is the temperature anomalies from January 1880 to December 2022. A temperature anomaly is a departure from a global mean value, which is often a large-scale moving average of the planet’s temperature. A positive (negative) anomaly in this time series shows that the temperature was warmer (cooler) than the global mean.

2.2 Dataset transformations

As seen above, one can identify an increasing trend. For prediction purposes, one might only want to focus on recent years. For this reason, we chose to take a subset of the time series, from \(1960\) onwards, as one can argue that there is a changing point in the trend around this year, from fluctuating to a more or less linear growth.

3 Modeling the temperature using GLS and ARMA

3.1 Identifying a trend

From the above time-series, one now has to find a model which could fit the increasing trend of the temperature anomalies. In this project, we chose to fit the trend by using a few predictors : a linear, quadratic and exponential term with respect to the time, starting at \(1\) from January \(1960\) onwards (increments at each month). Furthermore, we also added months as factors to show monthly seasonality for the temperatures.. One can argue that most of the earth’s landmass is in the northern globe, there may be some sort of a seasonal behavior as the global mean would therefore be biased to a northern climate behavior.

From this model, we can have a first idea whether seasonality and if there is an exponential growth of the temperature anomalies. To do this, we show several t-tests below for each coefficients.

Table 3.1: Summary of the basic model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0531 0.0227 -2.34 0.0195
t 0.000588 1e-04 5.88 6.2e-09
I(t^2) 1.16e-06 1.35e-07 8.56 6.37e-17
exp(t/20) -6.21e-18 2.09e-18 -2.97 0.00303
month2 0.0146 0.0246 0.593 0.553
month3 0.0416 0.0246 1.69 0.0915
month4 -0.0114 0.0246 -0.465 0.642
month5 -0.0319 0.0246 -1.3 0.195
month6 -0.0413 0.0246 -1.68 0.0939
month7 -0.0458 0.0246 -1.86 0.0628
month8 -0.0385 0.0246 -1.57 0.118
month9 -0.0423 0.0246 -1.72 0.0861
month10 -0.0284 0.0246 -1.16 0.248
month11 -0.0274 0.0246 -1.11 0.266
month12 -0.0372 0.0246 -1.51 0.131

As one can see above, we see that coefficients are essentially close to zero, as the values we are trying to plot are usually close to zero, as seen in the histograms above. The months coefficients have non-zero values but are less significant, except for March, June, September (\(0.09\)) and July (\(0.06\)) with p-values in parentheses.

Furthermore, one observe that the exponential coefficient is actually negative. This is due to the last elements of the time series, as seen in the fitted plot above. One can see that the trend start decreasing around \(2020\). If we check the value of the exponential term in the last month of the time-series, we get a value of \(\exp(756/20) = 2.6*10^{16}\), and with the negative coefficient, it gives us a total contribution of \(-0.161\), compared to an intercept of \(-0.05\), a linear contribution of \(0.44\) and quadratic trend of \(0.66\). So one can conclude that this exponential part of the model actually dampens the temperature anomalies, and this dampening will increase over time.

However, one big downside using this linear model is that we assume that the data to be iid in order to use significance tests, which is not the case here. In our case, we can avoid this problem by using Generalized Least Squares which take into account the correlation in the noise:

\[Y = \textbf{X}\beta + \epsilon \text{ with } \epsilon \sim \mathcal{N}(0,\Sigma)\] where \(Y\) is the the temperature anomaly, \(X\) is the data matrix (containing the time, its transformation and dummy variables for every month), and \(\epsilon\) is the noise following a normal distribution with a covariance \(\Sigma \in \mathbb{R}^{N \times N}\), which makes the noise possibly dependent between the \(N\) observations. If we know the covariance matrix \(\Sigma\), and if it is invertible, one could define the new problem \[\tilde{Y} = \tilde{\textbf{X}}\beta + \tilde{\epsilon} \text{ with } \tilde{\epsilon} \sim \mathcal{N}(0,I_{N})\] where \(\tilde{Y} = \Sigma^{-\frac{1}{2}}Y\), \(\tilde{\mathbf{X}} = \Sigma^{-\frac{1}{2}}\mathbf{X}\), \(\tilde{\epsilon} = \Sigma^{-\frac{1}{2}}\epsilon\), which enables us to consider this new data as iid. The goal is therefore to find this matrix \(\Sigma\). To do so, we will use the time series theory to model the residuals as an ARMA model, and give this information to a Generalized Least Squares model. This will enable us to use the significance tests and answer the two main questions.

3.2 Finding the ARMA model for the residuals

From the initial fitted linear model described in Identifying a trend, we obtained residuals from which we can fit an ARMA model, hence finding the covariance matrix. To do so, we plot the ACF-PACF plots and try to find relevant orders for the \(AR(p)\) and \(MA(q)\) parts of the model.

From the above plots, we clearly identify \(AR(p=2)\), as there is a clear cut-off of the PACF values after the second lag. For \(q\), it is a bit more complicated but one could argue that after the second lag, there is a small drop from \(0.55\) to less than \(0.4\), we will therefore use this parameter \(q=2\). No additional seasonality has been found for the lags which are multiples of \(12\) (number of months), which means \(P,Q=0\). Now, one can look at various sub-models from the \(ARMA(p=2,q=2)\), and compare AIC and predictive checking to find the best model.

Table 3.2: Comparison of submodels
Model AIC Predictive.error
ARMA(2,2) -1270.616 0.0329813
ARMA(2,1) -1266.817 0.0333128
ARMA(1,2) -1265.299 0.0333152

From above tests, one finds that the \(ARMA(p=2,q=2)\) is better than its submodels, both from AIC and predictive checking. Furthermore, we can check the residuals of the fitted ARMA model, and assert their normality :

Except for the upper-tail, the residuals are close to the diagonal on the Q-Q Plot, and no more lags stand out from the ACF-PACF plots. Those mean that we explained well the residuals from the linear model by using the ARMA(2,2) process. Now, we can use this structure as our covariance matrix, as seen in the following section.

3.3 Fitting the GLS model and aswering the two main questions

Now that we have a model for the trend (see Identifying a trend) and for its residuals (see Finding the ARMA model for the residuals), one can define the GLS model as described above, and is written explicitly here:

\[ \tilde{Y} = \beta_0 + \beta_1 t + \beta_2 t^2 + \beta_3 \exp(t) + \beta_{4}M + \epsilon\] with \(\epsilon \sim ARMA(p=2,q=2)\) and \(M\) being the month factor (\(\beta_4\) is therefore a vector in \(\mathbb{R}^{12}\)).

4 Answering the two main questions

To answer the two questions, we first start by showing the summary of the fitted GLS model :

Table 4.1: Summary of the GLS model
Value Std.Error t-value p-value
(Intercept) -0.0494 0.0443 -1.11 0.266
t 0.000554 0.000276 2.01 0.0447
I(t^2) 1.2e-06 3.72e-07 3.24 0.00127
exp(t/20) -7.53e-18 4.93e-18 -1.53 0.127
month2 0.0151 0.0154 0.983 0.326
month3 0.0419 0.0163 2.57 0.0103
month4 -0.0107 0.0185 -0.577 0.564
month5 -0.0313 0.0187 -1.67 0.0947
month6 -0.0402 0.0198 -2.03 0.0423
month7 -0.045 0.0194 -2.32 0.0208
month8 -0.0372 0.0198 -1.88 0.0607
month9 -0.0412 0.0187 -2.2 0.0282
month10 -0.0267 0.0185 -1.44 0.15
month11 -0.026 0.0163 -1.59 0.112
month12 -0.0349 0.0155 -2.26 0.0241
Table 4.1: ANOVA test II of the GLS model
Df Chisq Pr(>Chisq)
t 1 4.043 0.044
I(t^2) 1 10.467 0.001
exp(t/20) 1 2.329 0.127
month 11 34.502 0.000

From the above summary, we observe similar results from the original basic linear model, except for the significance of the exponential coefficient, with a p-value of \(0.12\). Thanks to this summary, we can now answer the two main questions :

First, we still observe the negative effect of the exponential term, with a negative coefficient of \(-7.53e-18\), which is quite low due to the values of the exponential which grows rapidly with respect to time. Therefore, one can logically refute the exponential positive trend of the temperature anomalies, as the exponential term is used to damper the linear and quadratic part of the model instead of increasing it. The quadratic part is essentially used at the start of \(1960\), as one can observe in previous fitted trends above.

Regarding seasonality, we observe in the ANOVA test type 2 that the month factor has a strong significance, meaning that there is seasonality regarding the temperature month per month. As noted in the introduction, this could be due to more land in the north hemisphere than in the south.

5 Prediction up to \(2050\)

First, one observes that if we kept the exponential term as is, we would have a negative exponential exploding after a few years, which is realistically very unlikely to happen. We therefore remove the exponential trend from the original model, our model is therefore :

\[\tilde{Y} = \beta_0 + \beta_1 t + \beta_2 t^2 + \epsilon \quad \text{ with } \epsilon \sim ARMA(p=2,q=2)\] with coefficients :
Table 5.1: Summary of the GLS model without exponential term
Value Std.Error t-value p-value
(Intercept) -0.064 0.0441 -1.45 0.147
t 0.000716 0.00026 2.76 0.00599
I(t^2) 9.27e-07 3.32e-07 2.79 0.0054
month2 0.0152 0.0154 0.991 0.322
month3 0.0422 0.0163 2.59 0.00976
month4 -0.0104 0.0185 -0.562 0.574
month5 -0.031 0.0187 -1.66 0.0978
month6 -0.04 0.0198 -2.02 0.0434
month7 -0.0448 0.0194 -2.31 0.0212
month8 -0.0371 0.0198 -1.88 0.0609
month9 -0.0412 0.0187 -2.2 0.0279
month10 -0.0269 0.0185 -1.46 0.146
month11 -0.0264 0.0163 -1.62 0.106
month12 -0.0356 0.0154 -2.3 0.0215

Thanks to our model, we can now try predicting the temperature anomalies up to \(2050\):

As seen above, we observe an uptrend with seasonal effect. One can observe that the noise is not really seen in those predictions. This can be due to the large time frame which is used here, and that when you forecast far from the observed value, the prediction of the noise will tend to its mean, which is zero. Therefore, the only trend that we observe after such an amount of time is the fitted trend. By removing the exponential term we do not observe the negative dampening which happened earlier.

5.1 Comparison with GAM trend and SARIMA

In this project, we have focused on GLS models in order to create independence in the data and apply significance tests to answer our two main questions. However, one could only be interested in the prediction and not the significance of seasonality or exponential trends. This is why, instead of working with Generalized Least Square, one can try to use other methods to predict the temperature anomalies.

The method presented here consists of fitting the trend with a Generalized Additive model and then find SARIMA parameters for its residuals. This section is considered additional and only gives the reader another way to forecast temperature anomalies, without assessing the two questions. The methods really resemble the one previously presented and lots of analyses are analogous to previous ones, hence the short description of certain results in this section.

First, we start by defining a Generalized Additive Model for trend fitting. A generalized additive model is a generalized linear model which includes non-linear relationships between variable and the response. The response is given by a sum of smooth functions (usually spline basis functions) of the variables as well as any other GLM-compatible variable. For further details, see Hastie 1986. In our case, we use a GAM model with smoothing function on the time t, and we also add the months as factors, to add the monthly seasonality.

Table 5.2: Summary of the basic model

From the residuals, we can apply the same type of analysis than previously done with the linear model residuals in Identifying a trend. Hereunder is the ACF-PACF plots from the residuals :

We first observe that those really resembles the ones studied in the previous residual analysis.From the ACF plot, we observe an exponential decay, which goes below the significance threshold after the order \(5-6\), but no real cut-off. From the PACF, we observe a cutoff after lag \(2\). One could therefore think of \(p=2\) for the AR part of the ARMA model, and no clear value for \(q\), even though we see an important decrease (yet no cut-off) at \(q=2\).

As before, PACF and ACF looks good for white noise. QQ-Plots seem reasonable. Now one can have a look at submodels of the ARMA plot and check whether lower orders for the AR and MA component make sense

Table 5.3: Comparison of submodels
Model AIC Predictive.error
ARMA(2,2) -1276.773 0.0323778
ARMA(2,1) -1274.074 0.0326328
ARMA(1,2) -1271.360 0.0326824

As previously observed, we still obtain better results for the ARMA(p=2,q=2) model. We can now have a further look into yearly seasonality and differentiation, by plotting ACF and PACF with yearly lags in red.

Table 5.4: Comparison with or without P-seasonality
Model AIC Predictive.error
ARMA(p=2,q=2) -1276.773 0.0323778
SARIMA(2,0,2)x(2,0,0) -1278.698 0.0322291
SARIMA(2,0,2)x(1,0,0) -1275.818 0.0324311

Even though it is not clear from the PACF-ACF plots above, one can try to add the dependence on the lags (months) \(12\) and \(24\) from the PACF plot, and check whether we obtain better AIC and predictive checking than the basic ARMA(p=2,q=2) model, as it is not very computationally expensive to check. One finds that the SARIMA (2,0,2)x(2,0,0)[12] (which is the one we will keep) yields better results than the SARIMA(2,0,2)x(1,0,0) and ARIMA(2,0,2).

We here observes that the remaining residuals from the fitted SARIMA models are white noise, as no lag is identified from (P)ACF plots and the Q-Q plots look normal. We can now compare this model to the one computed from calling the auto.arima function on the GAM residuals, and compare AIC and predictive error

Table 5.5: Comparison Autoarima vs SARIMA(2,0,2)x(2,0,0)
Model AIC Predictive.error
Autofit -1277.461 0.0325240
SARIMA(2,0,2)x(2,0,0) -1278.698 0.0322291

Lastly, calling auto.arima on the GAM residuals would give us ARIMA(2,0,3)(1,0,0)[12] with zero mean, which is slightly different from what we have found. However, we get lower AIC and lower predictive error.

Finally, we can try predicting using this model, by adding the trend model to the noise from the fitted SARIMA model.

This is quite interesting, as it is completely different from the GLS plot seen above. First, we observe that the GAM model predicts a slow decrease of temperature anomalies up to \(2050\). We also observe the work of the SARIMA model for the noise fitting in the first few years for the prediction, before returning to approximately zero as the years increase (as it should be). This essentially means that depending on the underlying trend model that we choose, we can produce completely different long-term trends, which is appealing.

6 Conclusion

In conclusion, the goal of this project was to understand temperature anomalies from the perspective of time series, and try to answer to main questions : Whether there is a seasonality in the temperature anomalies, and whether we can refute the exponential trend of those anomalies. To do so, we first tried to fit a trend on the monthly temperature anomalies from \(1960\) onwards. By creating a linear model containing a linear, quadratic and exponential component wrt time, and months factors, one could try to apply statistical tests on the fitted coefficients and conclude from that. However, one has to assume the data to be iid, which is not the case in this project. Therefore, we had to use GLS models and fit an ARMA model on its residuals. Once this was done, we could answer the two main questions using significance testing, which in this case is legal due to the transformation with the new noise covariance matrix. We answered negatively regarding the exponential trend of temperature anomalies and positively for the monthly seasonality.

Additionally, we explored another way to predict the temperature anomalies by using GAM and SARIMA models, with analogous methods than the GLS prediction regarding the models’ fitting. We observed different predictions up to \(2050\), which is interesting. Further steps would be to apply other methods to tackle this problem such as spectral methods to find ARIMA parameters, as well as linking those temperature anomalies to \(CO2\) emissions and maybe try to predict the first using the latter in a bigger time-series model.